-
Notifications
You must be signed in to change notification settings - Fork 1
Ensemble attack: Meta classifier pipeline #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Add XGBoost training pipeline * Add XGBoost and LR pipeline * Metaclassifier training finalized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice implementation of the blending++ pipeline! I also double-checked your implementation against the original ensemble codebase, and it matches perfectly 🙂.
I’ve left some comments, and I’ll address the ones where you’ve tagged me.
One suggestion (feel free to ignore): to clarify the different dataframes in this pipeline, in addition to docstrings, we could also reference the descriptions and diagrams in the example README.
examples/ensemble_attack/config.yaml
Outdated
trans_json_file_path: ${base_example_dir}/data_configs/trans.json | ||
population_sample_size: 40000 | ||
|
||
# Metadata for real data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is a good idea, but can we create a trans_metadata.json
or real_metadata.json
file to store this information? I am suggesting this because we have several data config files like trans_domain.json
and info.json
with similar type of metadata information. We can keep this config.yaml
for only attack pipeline related configurations. Then we can set the path to this metadata json file here, similar to trans_domain_file_path
and load it in BlendingPlusPlus init.
log(INFO, "Data processing pipeline finished.") | ||
if config.pipeline.run_data_processing: | ||
run_data_processing(config) | ||
elif config.pipeline.run_metaclassifier_training: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably just have another if
here instead of elif
since someone might want to run data processing as well as meta classifier training.
Path(config.data_paths.processed_attack_data_path) / "master_challenge_test_labels.npy", | ||
) | ||
|
||
df_synth = load_dataframe( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the synthetic data generated from the TabDDPM model trained on the df_real_train
data from saved at Path(processed_attack_data_path, "real_train.csv")
? Maybe one comment to say where this synth data is coming from since we still have not added the code for that would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, this synthetic data has come from the original repository because I just wanted to test that it would run. I did the same with the RMIA signals file, only using them as placeholders. I'll add the comment for future reference, though. Thanks!
|
||
# 3. Get RMIA signals (placeholder) | ||
rmia_signals = pd.read_csv( | ||
"examples/ensemble_attack/data/attack_data/og_rmia_train_meta_pred.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is temporary because we still don't have the RMIA computation pipeline, but can we get this path from config.yaml instead of fixing it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep it like this for now because I'm not passing config.yaml to blending.py and don't plan on doing it. It will be resolved soon though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! Makes sense 👍
|
||
study = optuna.create_study( | ||
direction="maximize", | ||
sampler=optuna.samplers.TPESampler(n_startup_trials=10, seed=np.random.randint(1000)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is the random seed used in the original code, but as David suggested we can make it an option for the class and maybe pass here the user specified random_seed (it could be the one user sets in config.yaml).
colsample_bytree=trial.suggest_float("colsample_bylevel", 0.5, 1), | ||
reg_alpha=trial.suggest_categorical("reg_alpha", [0, 0.1, 0.5, 1, 5, 10]), | ||
reg_lambda=trial.suggest_categorical("reg_lambda", [0, 0.1, 0.5, 1, 5, 10, 100]), | ||
tree_method="auto", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, was there an issue with tree_method="auto" if not use_gpu else "gpu_hist",
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. I wanted it to run on my local without taking up GPU space. Should've just changed use_gpu
to false in config.yaml :)
.gitignore
Outdated
# Trained metaclassifiers | ||
examples/ensemble_attack/trained_models/ | ||
examples/ensemble_attack/attack_results/ | ||
examples/ensemble_attack_example/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't have this last one anymore
examples/ensemble_attack/config.yaml
Outdated
"categorical": ["trans_type", "operation", "k_symbol", "bank"] | ||
"variable_to_predict": "trans_type" | ||
|
||
col_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am missing where col_type and bounds are used. Maybe adding some comments here would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original codebase, bounds are used when they're training the XGBoost model to preprocess categorical columns, while the meta classifier features don't include them at all. I added them just in case, but didn't use them.
As for "col_type", you're right and they haven't been used yet. Not sure if they ever will.
I'll remove both for now and add them as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff, thanks for addressing the comments! Approving with just some minor ones now.
As others have commented in this PR as well, please wait for their approval or make sure to address all their comments.
examples/ensemble_attack/config.yaml
Outdated
"clavaddpm_black_box", | ||
"clavaddpm_white_box", | ||
] | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: no need to add this space here, it messes up the indentation.
process_split_data( | ||
all_population_data=population_data, | ||
processed_attack_data_path=Path(config.data_paths.processed_attack_data_path), | ||
# TODO: column_to_stratify value is not documented in the original codebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is not true anymore? I see docstrings in that function right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think by "original codebase", @fatemetkl meant the submission repository (link), but I'm not sure what the "TODO" is for. My guess is we test things with stratified columns specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
The "original codebase" in this comment refers to the attack submission. I should have added a link. So sorry for the confusion! I will fix it in my PR.
As Sara mentioned, since this parameter wasn’t documented in the original attack codebase, I added a TODO to experiment with other columns in case the one I specified isn’t the correct one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all the edits and work towards addressing the comments. Most of my previous comments have been address really nicely. I have a few additional small ones and I left a few comments of mine unresolved, as I think a few small changes remain before they are fully address. Let me know if you have questions on any comments!
@@ -1,70 +1,96 @@ | |||
# Blending++ orchestrator, equivalent to blending_plus_plus.py in the submission repo | |||
"""Blending++ orchestrator, equivalent to blending_plus_plus.py in the submission repo.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than just referring to the "submission repo" I would add a link to said repo here if possible.
lr_model = LogisticRegression(max_iter=1000) | ||
self.meta_classifier_ = lr_model.fit(meta_features, y_train) | ||
|
||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to keep this error if something weird happens (i.e. we expand the enum but forget to update this).
from midst_toolkit.attacks.ensemble.train_utils import get_tpr_at_fpr | ||
|
||
|
||
optuna.logging.set_verbosity(optuna.logging.WARNING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just add a comment here why you're setting this?
def __init__( | ||
self, | ||
input_features: pd.DataFrame, | ||
label: np.ndarray, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super minor, maybe call this labels instead of label?
("preprocessing", preprocessing), | ||
( | ||
"xgboost", | ||
xgb.XGBClassifier( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment remains relevant? We have a use_gpu
argument, but it isn't making it's way into the class?
|
||
def _evaluate_pipeline_cv(self, trial: Trial, num_kfolds: int) -> float: | ||
""" | ||
Performs cross-validation on the pipeline and returns the mean TPR at a fixed FPR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this is doing it at a specific FPR threshold (0.1) and it isn't configurable. So I think the docs need to mention this here, and in the return value.
Path(config.data_paths.processed_attack_data_path) / "master_challenge_test_labels.npy", | ||
) | ||
|
||
# Synthetic data borrowed from the attack implementation repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a link to where this was borrowed from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing the comments. Most of my comments are addressed, except for a few minor ones about the attack example config, the DOMIAS documentation, and the extra line in .gitignore
, I believe. Happy to discuss 🙂.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! 🚀
c83ecea Trainer: Refactoring process_pipeline_data function (#54) 2be0f81 Bump astral-sh/setup-uv from 6.7.0 to 6.8.0 (#56) 004fc67 Trainer: general variable renamings on data_loaders.py and adding a couple more enums (#53) 4c2d75f Trainer: Refactoring the _pair_clustering_keep_id (#52) d8fc981 Ensemble attack: Meta classifier pipeline (#37) git-subtree-dir: deps/midst-toolkit git-subtree-split: c83ecea
Short Description
Clickup Ticket(s): (https://app.clickup.com/t/868fm2hg6)
Both choices of meta classifiers can be trained and saved as pickle files. Predictions can also be done and evaluated with using TPR@FPR. The PR includes preprocessing methods for getting training features as well (e.g. Gower distance, the DOMIAS method.)
Things to note:
Tests Added
Tests have been added for appropriate model fitting, data preprocessing, and the prediction function.